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Learning in restricted Boltzmann machine is typically hard due to the computation of gradients 
of log-likelihood function. To describe the network state statistics of the restricted Boltzmann 
machine, we develop an advanced mean field theory based on the Bethe approximation. Our theory 
provides an efficient message passing based method that evaluates not only the partition function 
(free energy) but also its gradients without requiring statistical sampling. The results are compared 
with those obtained by the computationally expensive sampling based method. 
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Restricted Boltzmann machine (^RBM) forms building 
blocks of a deep belief network [l|, [^, which is able to 
learn complex internal representations of structured ob¬ 
jects (such as nature image, speech, or hand writing). 
RBM also has wide applications in computational biology 
problem, for example, modeling high-dimensional neural 
activity data from cortical microcolumns Q. 

However, learning in RBM is computationally hard, 
since gradients of the log-likelihood function needs to be 
computed at every iteration step to update the model 
parameters. This computation is usually accomplished 
by Gibbs-sampling-based method or its variants Hi, 
for which the tradeoff between accuracy and convergence 
speed requires careful considerations. Furthermore, an 
efficient way to evaluate the partition function (e.g., log- 
likelihood function for cross-validation analysis) remains 
unknown. 

Here, we develop a mean field theory for the RBM 
based on the cavity method (Bethe approximation) Q, 
which yields an efficient and fully-distributed algorithm 
to evaluate the free-energy (partition function) of a RBM 
of interest. The remarkable efficiency is confirmed by 
comparing the computation results of gradients of log- 
likelihood function by Gibbs sampling and the proposed 
mean field theory. 

A RBM 0, 0 consists of one hidden layer and one 
visible layer without lateral connections between nodes 
in each layer. We assume the hidden layer has M nodes, 
while the visible layer has N nodes. Hidden node a with 
external field ha is connected to visible node j with field 
(jij by a symmetric coupling Waj ■ The energy function for 
RBM is thus defined by E = - '^i,a ^i^aiSa - Yl,i - 
^aSa, where Gi and Sa are used to specify the state of 
visible node i and hidden node a, respectively. Due to the 
conditional independence of hidden nodes’ state given cr, 
the state of the hidden nodes can be marginalized. This 
leads to the following probability of a visible state: 

P(cr) = 2 [2 cosh{Wa ■ cr -f /iq)] , (1) 

a i 

where Wa denotes the a-th row vector of the coupling 
matrix w. Z is a normalization constant (also called the 
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FIG. 1: (Color online) A RBM is composed of one hidden 
layer and one visible layer. No lateral connections exist within 
both hidden and visible layers. Each hidden node is connected 
to all visible nodes with symmetric coupling weights, and is 
responsible for capturing high order dependence. The origi¬ 
nal RBM is shown in the left panel as an example of three 
hidden nodes (solid circles) and five visible nodes (empty cir¬ 
cles). The right panel shows a transformed factor graph after 
marginalization of hidden states for theoretical analysis. Each 
factor node (square node) represents the probabilistic normal¬ 
ization of a hidden node given the state of all visible nodes 
(see the main text). 


partition function) of the model. As a model study, we 
assume the element of the matrix w is independently and 
identically distributed with a normal distribution with 
mean zero and variance g/N. We assume that the exter¬ 
nal field for both layers follows a normal distribution with 
mean zero and variance v. We denote the ratio between 
the number of hidden nodes and that of visible nodes by 
a = M/N, where M and N can be arbitrarily large. A 
schematic representation of a RBM {M = 3, A^ = 5) is 
shown in Fig. [T] 

An exact computation of Z requires an exponential 
computational complexity (2^), which becomes impos¬ 
sible for a relatively large N. However, advanced mean 
field approximation can be used to compute approximate 
values under certain condition, and its prediction should 
be compared with numerical simulations. Here, we pro¬ 
pose the Bethe approximation @ to tackle this prob¬ 
lem. First, we transform the original model (left panel of 
Fig.p into a factor graph (right panel of Fig. m i, 
where each square node indicates a Boltzmann factor 
2 cosh{wa -cr-y ha) in Eq. ([T]). Then, we introduce the cav¬ 
ity probability Pi^a(o'i) that the visible node i takes state 
Gi in the absence of the contribution from factor node 
a [13 , and Pi^aio'i) satisfies the following self-consistent 
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equations: 

Pi^a{<Ti) oc /Zh_>j(cri), (2a) 

h^di\a 

^ ^ 2 cosh ^ 1/^5 ’ ^ H” ^fc) n 

{CTj|je9h\i} j£db\i 

(2b) 

where the symbol oc indicates a normalization constant, 
di\a denotes the neighbors of node i except factor node 
a, db\i denotes the neighbors of factor node b except 
visible node i, and the auxiliary quantity repre¬ 

sents the contribution from factor node b to visible node 
i given the value of di 0- With these definitions, the 
products in Eq. @ are reasonable under the weak cor¬ 
relations assumption, whereas, the validity of this Bethe 
approximation should be checked by a stability analysis. 

Note that the computation in Eq. (Ei) is still in¬ 
tractable due to the summation over all possible cr ex¬ 
cept di. However, because Ub-^i = 12j^db\i'^bjO'j is 
a sum of a large number of nearly independent ran¬ 
dom variables, the central-limit theorem implies that 
the distribution of Ub^i is well characterized by its 
mean and variance 0, he., Gb^i — '^j^gb\i^bj'm,j^b 
and ~ “ ^‘j^b) respectively, where 

rrij^b = '^jG’j^bio'j) denotes the cavity magnetiza¬ 
tion (the average of dj in the absence of factor node b). 

Because we consider the binary spin variable d^ = ±1, 
Pi^a{<yi) and ^b^iipi) can be parametrized by rrii^a 
and cavity bias Ub^i, respectively. Ub^i is defined as 
k practical recursive equations, the 

so-called message passing equations, are thus derived as: 


mi_>a = tanh ^ Ub^i | , 

y b^di\a 

_ 1, cosh(/tj, -I- Gb^i + Wbi) 
2 ^ cosh{hb + Gb^i - Wbi)'' 


(3a) 

(3b) 


where the 'E.b^i dependency in Eq. (I3bl) drops because of 
the symmetry of cosh. The cavity magnetization can be 
understood as the message passing from the visible node 
to the factor node, while the cavity bias is interpreted as 
the message passing from the factor node to the visible 
node. This message passing based computation is much 
more accurate than naive mean field approximation [0 , 
which assumes a fully factorized distribution for Eq. O- 
In contrast, Eq. ([3]) captures nearest neighbors’ correla¬ 
tions. 

Once the iteration of Eq. converges, the free energy 
of the model can be computed from the fixed-point so¬ 
lution. Under the Bethe approximation, the Bethe free 


energy is expressed as [10|, lll| : 


F = -J2^nZ, + iN-l)J2^nZa, 


( 4 ) 


where Z, = e'^- Obea*+ e Ilhea* 
in which ^b^i{o'i) = 2e“>>->*/^ cosh(/i;, -|- Gb^i + WbiUi)- 
Za = 2e““/^ cosh(/ia -I- Go)- Ga and 5^ are given by 
Ejeoa Wajrrij^a and respectively. 

Each stable solution of the message passing algorithm of 
Eq. ([3|) corresponds to a local minimum of the free energy 
function in Eq. O 0- 

RBM defined in Fig.[T]is basically a densely-connected 
graphical model. Our mean field theory provides a prac¬ 
tical way to estimate the free energy of single instances 
(typical examples of the model). More precisely, we ini¬ 
tialize the cavity magnetization and bias on each link 
of the factor graph by random values, and then iterate 
Eq. ([3]) until it converges within a prescribed accuracy. 
Note that the overall time complexity is of the order 
0{N'^), furthermore, the algorithm is fully-distributed, 
and thus amenable to large-scale applications. 

In the remaining part, we demonstrate the computa¬ 
tion of the free energy on large single instances by ap¬ 
plying the message passing algorithm, and confirm the 
accuracy of the results by comparing gradients of the 
log-likelihood with those obtained by the Gibbs sampling 
method. A stability analysis of the message passing al¬ 
gorithm is also presented. 

We run the message passing equations on single in¬ 
stances of RBMs as size N, hidden-node density a, and 
coupling strength g are varied. As displayed in Fig. [2] 

(a) , the free energy density decreases as a increases. Note 
that the density does not change significantly at two large 
sizes {N = 100 and N = 1000). Furthermore, the in¬ 
set of Fig. [2] (a) shows that the theoretical result even 
matches well with the exact enumeration result for small 
size N = 20. As the variance parameter g of weights 
increases, the free energy density also decreases (Fig. E] 

(b) ). The same property also holds when the variance v 
of external field increases (the inset of Fig. [2] (b)). In the 
explored range of g (or v) and a, Eq. m converges in a 
few steps to single fixed point on which the free energy is 
calculated. Therefore, the Bethe approximation provides 
an accurate estimation of free energy much faster than 
other sampling based procedures which are typically slow 
to reach an equilibrium state. 

The stability of Eq. ([3]) can also be studied on sin¬ 
gle instances. Apart from the cavity magnetization, we 
introduce its variance as an extra message denoted by 
The evolution of Vi^a follows: 


V,_ 
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E 

h^di\a 


X [tanh(rh^j) - tanh(r{,^i - 2wbi)f , 


( 5 ) 


where = hb + Gb^i Wbi and Pb^i = 

The stability is measured by the to¬ 
tal variance S{t) = J2{i a) '^i^a(t) summed over all con¬ 
nected pairs (z,a), where t is the iteration step. The 
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FIG. 2: (Color online) Free energy density (/ = F/N) of 
single instances of RBMs. Iterations of the message passing 
equations are always converged to produce the data points. 
The error bars give statistical errors across ten random in¬ 
stances. (a) free energy density as a function of a (density of 
hidden nodes), g = 1.0, and v = 0.05. The inset shows the 
absolute difference, |A/|, of the free energy density estimated 
by the exact enumeration and the Bethe approximation (BA) 
for N = 2Q (see a comparison for ten instances with a = 0.6 
in the inset, where the line indicates equality), (b) free energy 
density as a function of weight strength g and field strength 
V (shown in the inset), a = 0.5. 


explosion of S{t) indicates the instability of the mes¬ 
sage passing equations, which is related to the diver¬ 
gence of the (non-linear) spin glass susceptibility 0 , and 
thus the Bethe approximation becomes inconsistent. We 
study this effect on single instances of RBM as shown in 
Fig.[3l A relative strength is denoted as A = S{t+1)/S{t) 
where t denotes the step at which the iteration converges 
or exceeds a prefixed maximal number (tmax = 500). A 
grows with a and g, and the fluctuation across instances 
becomes strong near to the critical point (A = 1). Note 
that increasing a has an equivalent effect of increasing g. 



FIG. 3: (Golor online) Stability parameter A as a function 
of model parameters. A = 1000, a = 0.5, and v = 0.05. The 
error bars give statistical errors across ten random instances. 
The left inset gives stability versus a with g = 1.0 and v = 
0.05. The right inset gives two examples (black and red) taken 
from the main figure at g = 2.1. 5(0) is the initial total 

variance. 

In the right inset, two typical examples are shown. Near 
to the critical point, some instances have decaying vari¬ 
ance strength (A < 1), while some have growing strength 
(A > 1). Eq. dS]) thus tells us how stable (unstable) the it¬ 
eration of Eq. ([3]) is for a particular RBM. The algorithm 
converges in a few iteration steps to a solution unless the 
recursive process is close to the instability boundary. 

RBM can be used to model the real data, and the 
parameters are fitted to maximize the probability of ob¬ 
serving the training data 0. This would lead to com¬ 
putation of the following quantities, mi = (tTi), rha = 
{tanh{wa ■ cr + ha)) and C'aj = {tanh{wa ■ cr + ha)crj), 
where the average, (•), is taken over the distribution de¬ 
fined in Eq. o, which is intractable without approxi¬ 
mations. Here, mi is the average of visible state at, iha 
is the average of hidden state Sa, and Caj is the correla¬ 
tion between Sa and ai. Although an accurate evaluation 
of the above quantities requires a sufficiently long Gibbs 
sampling (so-called the most difficult negative phase in 
machine learning community i) , we can compute them 
by the message passing equations and then compare the 
results with those obtained by Gibbs sampling to check 
the consistency of the theory. 

Following the same spirit, our theory gives the theo¬ 
retical evaluation of the above quantities as 


tanh ( ^ Ub^i j , 

(6a) 

\ bGdi / 


J Da;tanh(Saa; + Gq), 

(6b) 

rhamj + Waj{l - m‘j)Aa, 

(6c) 
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where Da; = e ^ y/^dx is a Gaussian measure. Aa = 
1-f Dxtanh^(EaX + Ga), Ga = J2k&da'^o-k'mk + ha, and 
Sa - Efcgoa [ 13 . Eq. dH) is computed based 

on the fixed point of the iterative algorithm (Eq. (| 31 )). 

We used alternating Gibbs sampling Q to evaluate 
the equilibrium average of the gradients. More precisely, 
the hidden nodes are updated in parallel according to 
P{sa = 1 |(t) = cosh(i(;a • cr + ha)), while 

the visible nodes are then all updated in parallel ac¬ 
cording to P{ai = l|s) = /{2 cosh{wJ ■ s + (j)i)) 

where Wi is the i-th column of the weight matrix. Note 
that the visible nodes are conditionally independent given 
the hidden states and vise versa Q. These two steps 
of updates form one full step of the alternating Gibbs 
sampling. If this Markov chain is run for a sufficiently 
long time, the stationary (equilibrium) distribution is ex¬ 
pected to be reached, from which the averages can be 
estimated. We test our theory in a system with N = 100 
visible nodes and run the Markov chain with 10® steps 
for thermal equilibration and the other 4 x 10® steps 
to collect a total number of 10® samples to calculate 
the average. We measure the performance by the root- 
mean-square (RMS) error between the Gibbs sampling 
(GS) result and the Bethe approximation (BA) result, 
which is shown in Fig. 01 The RMS error is defined as 

5y = S!=i ~ where Y takes either 

m, rh, or C, and |1^| indicates the number of these pa¬ 
rameters. Small RMS error indicates that the Bethe ap¬ 
proximation is accurate. 

As shown in Fig. 0] (a), all evaluation errors grow with 
the weight strength g, which is reasonable since our mean 
field theory will break down when the network enters a 
strongly correlated state, as already shown by the stabil¬ 
ity analysis. In a similar manner, the error grows with the 
hidden node density a, because each hidden node puts 
a constraint to the network and all constraints compete 
with each other to give an equilibrium state, resulting in 
strong correlations with high a. However, the magnitude 
of all errors is small, implying that one can acquire ac¬ 
curate estimation of gradients of log-likelihood function 
by passing messages on a factor graph as well. We show 
this point more clearly with a scatter plot in Fig. 0] (b) 
(Bethe approximation result versus Gibbs sampling re¬ 
sult). This accuracy is obtained by requiring much fewer 
computational costs compared with the Gibbs sampling. 
The comparison further confirms the efficiency of the pro¬ 
posed mean field method across a wide range of model 
parameters. 

Note that the Gibbs sampling result serves as the 
ground truth here, since we run the Markov chain for 
a long time. More practically, one can estimate the 
statistics by A:-steps contrastive divergence (CD-A:) algo¬ 
rithm [3 that requires a time complexity of 0{kTMN), 
where T denotes the number of sample particles. How¬ 
ever, to reach a similar accuracy as the Bethe approxi- 
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FIG. 4: (Color online) Evaluation performance of mean field 
theory in comparison with the Gibbs sampling. Iterations 
of the recursive equations always converged to produce the 
data points. The error bars give statistical errors across ten 
random instances, (a) RMS error as a function of g. N — 
100, and v = 0.02. (b) Scatter plot for a typical example of 
N = 100, a = 0.5, 3 = 0.1, and v = 0.02. The inset shows an 
example oi g = 0.55 (other parameters do not change). The 
line indicates equality, (c) RMS error &m reached by CD-fc 
as a function of k in comparison to the Bethe approximation. 
N = 100, a = 0.2, g = 0.55, and v = 0.02. The result is 
averaged over five random instances. 
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mation, it typically requires A: > 10 and T ^ 10® under 
the current setup (Fig. |4] (c)). In contrast, the Bethe ap¬ 
proximation yields a time complexity of 0{nMN) with 
n < 100, where n is the number of iterations and one 
iteration involves the update oi MN cavity messages. 

In conclusion, we propose a mean field theory for the 
RBM, a widely used model in machine learning commu¬ 
nity and biological data analysis. The theory captures 
nearest neighbors’ correlations by operating on the cav¬ 
ity factor graph (by removing factor nodes), leading to 
an approximate estimation of the free energy function 
(log-likelihood function) for single instances of large-size 
networks, for which the standard Gibbs sampling proce¬ 
dure becomes prohibitively slow to get a reliable result 
(e.g., for evaluating the log-likelihood function for cross- 
validation analysis). Moreover, we replace the normal 
Gibbs sampling with a mean field computation based on 
message passing algorithm, to estimate the gradients of 
log-likelihood function and show its efficiency by exten¬ 
sive numerical simulations on single instances. The na¬ 
ture of this fast inference lies in the fact that, the informa¬ 
tion is exchanged locally between factor nodes and visible 
nodes, to reach a coherent fixed point, which may pro¬ 
vide a computation paradigm for probabilistic inference 
in neural networks. We expect the mean field theory in¬ 
spired calculation will be useful in practical applications 
and bring more insights to understand the RBM and its 
role in deep learning [18 1. 
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